{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-c97b01e39f10fa7f",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "# Exercise 11) Policy Gradients"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-e732f039d63130ba",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "In this exercise we will have a look at policy gradients. \n",
    "The theory of policy gradients applies to function approximators that decide on which action to choose. \n",
    "The function approximators we met in the past were employed to estimate the (action) value function. \n",
    "Since their task was to judge the quality of the current situation they are often referred to as \"critics\". \n",
    "In contrary, we can also use a function approximator to directly choose an action; we call these \"actors\". \n",
    "Why should we do that if we made it work with nothing more than a critic? \n",
    "Because this will finally allow us to make use of contiuous action spaces! Eureka!\n",
    "\n",
    "In this exercise we will use a new `gym` environment `LunarLanderContinuous-v3`.\n",
    "To run this environment please make sure to have `Box2D` installed: `pip install Box2D`.\n",
    "\n",
    "![](https://images.squarespace-cdn.com/content/v1/59e0d6f0197aea1a0abc8016/1507938542206-41S6K9T97YETKEHP0PQF/ke17ZwdGBToddI8pDm48kMR1yAHb8bPoH1-OdajP2rZZw-zPPgdn4jUwVcJE1ZvWQUxwkmyExglNqGp0IvTJZUJFbgE-7XRK3dMEBRBhUpyDg3tXaPHS4cFkn9Bnm-emI0BDr_E-XKAFKqWrx68ZVlLyhCgVi_FJvVMH7mHrc18/lunar_lander_success_example.gif?format=500w)\n",
    "\n",
    "Source: https://www.billyvreeland.com/portfolio/2017/1013/solving-openai-gym-nm4yz\n",
    "\n",
    "The main task is to land the LunarLander in the landing zone.\n",
    "An accident-free landing is defined by both legs coming into  ground contact with moderate velocity.\n",
    "We are dealing with a continuous state and action space as defined below.\n",
    "Please notice that the control functions for main and side engines contain a dead zone in which the engines are inactive.\n",
    "The reward is mainly defined depending on whether the landing procedure is successful (+100) or not (-100).\n",
    "Firing the main engine gives a small negative reward. \n",
    "The problem is solved if a return of at least 200 is reached. \n",
    "For more information see https://gymnasium.farama.org/environments/box2d/lunar_lander/.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-388eaa3e36f18307",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "\n",
    "\\begin{align}\n",
    "\\text{state}&=\n",
    "\\begin{bmatrix}\n",
    "p_x\\\\\n",
    "p_y\\\\\n",
    "v_x\\\\\n",
    "v_y\\\\\n",
    "\\varphi\\\\\n",
    "20 \\, \\omega\\\\\n",
    "1 \\text{ if left leg has ground contact, else } 0\\\\\n",
    "1 \\text{ if right leg has ground contact, else } 0\\\\\n",
    "\\end{bmatrix}\n",
    "\\\\\n",
    "\\text{action}&=\n",
    "\\begin{bmatrix}\n",
    "\\text{main engine: } [-1, 0] \\rightarrow \\text{off}, ]0, 1] \\rightarrow [50 \\, \\%, 100 \\, \\%] \\text{ of available power}\\\\\n",
    "\\text{side engines: } [-1, -0.5] \\rightarrow [50 \\, \\%, 100 \\, \\%] \\text{ of available left engine power}, [0.5, 1] \\rightarrow [50 \\, \\%, 100 \\, \\%] \\text{ of available right engine power}\\\\\n",
    "\\end{bmatrix}\n",
    "\\end{align}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-cb8f30d3aea6982a",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import gymnasium as gym\n",
    "from tqdm.notebook import tqdm\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.kernel_approximation import RBFSampler\n",
    "import sklearn.pipeline\n",
    "import sklearn.preprocessing\n",
    "\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.nn.functional as F"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-8986527608544bb1",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "env = gym.make(\"LunarLander-v3\",\n",
    "                continuous=True,\n",
    "              render_mode = \"human\")\n",
    "\n",
    "state = env.reset()\n",
    "while True:\n",
    "    state, reward, terminated, truncated, _ = env.step(env.action_space.sample())\n",
    "    done = terminated or truncated\n",
    "\n",
    "    if done:\n",
    "        break\n",
    "env.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-a5be36d9e2de2982",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    },
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## 1) Monte Carlo Policy Gradient\n",
    "Write a REINFORCE algorithm."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-5fef942298a07621",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    },
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Execute the following cell to fit the featurizer using RBFSampler, like already learned in the last exercises. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-6a414b0a2359c70d",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    },
    "pycharm": {
     "is_executing": true
    }
   },
   "outputs": [],
   "source": [
    "env = gym.make(\"LunarLander-v3\",\n",
    "                continuous=True)\n",
    "state_array = []\n",
    "state = env.reset()\n",
    "\n",
    "for i in tqdm(range(1000)):\n",
    "    state, _ = env.reset()\n",
    "    while True:\n",
    "        state, reward, terminated, truncated, _ = env.step(env.action_space.sample())\n",
    "        done = terminated or truncated\n",
    "        state_array.append(state)\n",
    "\n",
    "        if done:\n",
    "            break\n",
    "\n",
    "state_array = np.array(state_array)\n",
    "\n",
    "featurizer = sklearn.pipeline.make_pipeline(\n",
    "    sklearn.preprocessing.StandardScaler(),\n",
    "    sklearn.pipeline.FeatureUnion([\n",
    "    (\"rbf0\", RBFSampler(gamma=5.0, n_components = 1000)),\n",
    "    (\"rbf1\", RBFSampler(gamma=2.0, n_components = 1000)),\n",
    "    (\"rbf2\", RBFSampler(gamma=1.0, n_components = 1000)),\n",
    "    (\"rbf3\", RBFSampler(gamma=0.5, n_components = 1000)),\n",
    "    ]),\n",
    "    sklearn.preprocessing.StandardScaler()\n",
    ")\n",
    "\n",
    "_ = featurizer.fit(state_array)\n",
    "env.close()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-3c04ca9fbfe07902",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    },
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Use the following cell to define the function approximators for the policy.\n",
    "As seen in Algo.12.1 we need to calculate $\\nabla_\\theta \\mathrm{ln}\\,\\pi(u_k | x_k, \\theta)$.\n",
    "$\\pi$ is herein defined as the normal distribution : \n",
    "\\begin{align}\n",
    "\\pi(u_k | x_k, \\theta) & = \\frac{\\mathrm{exp} \\left( {-\\frac{1}{2} (u_k - \\mu_\\theta(x_k))^\\mathrm{T} \\mathbf{\\Sigma}^{-1}_\\theta(x_k) (u_k - \\mu_\\theta(x_k))} \\right)}{\\sqrt{(2\\pi)^p \\mathrm{det}(\\mathbf{\\Sigma}_\\theta(x_k))}},\\\\\n",
    "\\text{with}\\hspace{1em} p & = \\mathrm{dim}(u_k).\n",
    "\\end{align}\n",
    "\n",
    "Extend `loglikelyhoodGaussian` such that it returns $\\mathrm{ln}\\,\\pi(u_k | x_k, \\theta)$! \n",
    "Use the numpy equivalent `PyTorch` functions (e.g. `torch.linalg.inv()`).\n",
    "`PyTorch` functions are differentiable and can therefore be  used to calculate $\\nabla_\\theta \\mathrm{ln}\\,\\pi(u_k | x_k, \\theta)$.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class PolicyNetwork(nn.Module):\n",
    "\n",
    "    def __init__(self, input_dim, action_space_dim):\n",
    "        super(PolicyNetwork, self).__init__()\n",
    "        self.fc1 = nn.Linear(input_dim, 400)\n",
    "        self.fc2 = nn.Linear(400, 400)\n",
    "        self.fc3_mu = nn.Linear(400, action_space_dim)\n",
    "        self.fc3_sigma = nn.Linear(400, action_space_dim)\n",
    "\n",
    "    def forward(self, x):\n",
    "        \"\"\"Predict the parameters of the stochastic distribution from which the action\n",
    "        is sampled.\n",
    "        \"\"\"\n",
    "        x = F.leaky_relu(self.fc1(x), 0.1)\n",
    "        x = F.leaky_relu(self.fc2(x), 0.1)\n",
    "\n",
    "        mu_out = self.fc3_mu(x)\n",
    "        mu_out = torch.clip(mu_out, -1, 1)[0]\n",
    "\n",
    "        sigma_out = F.softplus(self.fc3_sigma(x))\n",
    "        sigma_out = torch.mm(sigma_out, torch.tensor([[0.01, 0], [0, 0.1]]))\n",
    "        sigma_out = torch.clip(sigma_out, 1e-4, 1)\n",
    "        sigma_out = torch.diag_embed(sigma_out[0])\n",
    "\n",
    "        return mu_out, sigma_out\n",
    "\n",
    "\n",
    "def loglikelihood_gaussian(u, arg_mu, arg_sigma):\n",
    "    \"\"\"Evaluate the loglikelihood of the multivariate gaussian.\n",
    "    \n",
    "    Args:\n",
    "        u: Where to evaluate the loglikelihood of the gaussian. In our case, this is\n",
    "            the action whose likelihood we are evaluating.\n",
    "        arg_mu: The mean vector of the gaussian distribution\n",
    "        arg_sigma: The covariance matrix of the gaussian distribution\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()\n",
    "\n",
    "    return log_likelihood.squeeze()  # Ensure it's a scalar"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-451a7f093f5186bb",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "env = gym.make(\"LunarLander-v3\", continuous=True)\n",
    "\n",
    "state = np.reshape(env.reset()[0], (1, -1))\n",
    "feature_state = featurizer.transform(state)\n",
    "input_dim = feature_state.shape[1]\n",
    "action_space_dim = len(env.action_space.sample())\n",
    "\n",
    "policy = PolicyNetwork(input_dim, action_space_dim)\n",
    "\n",
    "# Regularization; downscaling of the network parameters to prevent divergence\n",
    "with torch.no_grad():\n",
    "    for param in policy.parameters():\n",
    "        param.mul_(0.4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-ed46e45053d3ffe3",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Use the following template to write a REINFORCE algorithm.\n",
    "This time the Adam (adaptive moment estimation) optimizer is used, which is an enhanced SGD optimizer.\n",
    "For more information on the optimizer, see https://arxiv.org/abs/1412.6980."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def interact(state, policy, deterministic, featurizer):\n",
    "    \"\"\"Interact with the environment to get to the next state.\n",
    "    \n",
    "    Args:\n",
    "        state: The current state of the environment\n",
    "        policy: The model that returns the mean and covariance matrix of the policy distribution\n",
    "        deterministic: Whether to choose actions randomly or use the mean\n",
    "        featurizer: The featurizer for state preprocessing\n",
    "\n",
    "    Returns:\n",
    "        next_state: The following state of the environment after the interaction\n",
    "        reward: The reward for the interaction\n",
    "        done: Whether the current episode is finished\n",
    "        action: The chosen action\n",
    "        mu: The mean vector of the policy distribution\n",
    "        sigma: The covariance matrix of the policy distribution\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()\n",
    "\n",
    "    return next_state, reward, done, action, mu, sigma\n",
    "\n",
    "\n",
    "def gather_experience(policy, featurizer, max_episode_len=1000):\n",
    "    \"\"\"Simulates a full episode and returns the gathered data.\n",
    "\n",
    "    Args:\n",
    "        policy: The model that returns the mean and covariance matrix of the policy distribution\n",
    "        featurizer: The featurizer for state preprocessing\n",
    "        max_episode_len: The number of steps at which the episode is terminated automatically\n",
    "\n",
    "    Returns:\n",
    "        states: The states visited in the episode\n",
    "        actions: The actions applied in the epsiode\n",
    "        rewards: The rewards gathered in the episode\n",
    "        probs_log: The loglikelihood values for the chosen actions\n",
    "        accumulated_rewards: The sum of rewards over the episode\n",
    "    \"\"\"\n",
    "    accumulated_rewards = 0\n",
    "\n",
    "    states = []\n",
    "    actions = []\n",
    "    rewards = []\n",
    "    probs_log = []\n",
    "\n",
    "    state, _ = env.reset()\n",
    "    state = torch.tensor(state.reshape(1, -1), dtype=torch.float32)\n",
    "\n",
    "    for _ in range(max_episode_len):  # Limiting each episode to a maximum length\n",
    "        # YOUR CODE HERE\n",
    "        raise NotImplementedError()\n",
    "\n",
    "        if done:\n",
    "            break\n",
    "    \n",
    "    return states, actions, rewards, probs_log, accumulated_rewards\n",
    "\n",
    "\n",
    "def compute_returns(states, rewards, gamma):\n",
    "    \"\"\"Compute the returns based on the gathered rewards.\"\"\"\n",
    "    g_returns = []\n",
    "    for t in range(len(states)):\n",
    "        # YOUR CODE HERE\n",
    "        raise NotImplementedError()\n",
    "    return torch.tensor(g_returns, dtype=torch.float32)\n",
    "\n",
    "\n",
    "def learn(states, rewards, gamma, probs_log, optimizer):\n",
    "    \"\"\"Learn from the gathered experience.\"\"\"\n",
    "\n",
    "    g_returns = compute_returns(states, rewards, gamma)\n",
    "\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()\n",
    "\n",
    "\n",
    "    return policy_loss.item()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-d5b2fb9302cbf856",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "env = gym.make(\"LunarLander-v3\", continuous=True, render_mode=\"rgb_array\")\n",
    "return_history = []\n",
    "\n",
    "alpha_policy = 1e-5\n",
    "optimizer = torch.optim.Adam(policy.parameters(), lr=alpha_policy)\n",
    "gamma = 0.99\n",
    "nb_episodes = 5000\n",
    "\n",
    "for j in tqdm(range(nb_episodes)):\n",
    "\n",
    "    states, actions, rewards, probs_log, accumulated_rewards = gather_experience(policy, featurizer)\n",
    "\n",
    "    _ = learn(states, rewards, gamma, probs_log, optimizer)\n",
    "\n",
    "    if j % 250 == 0 and j != 0:\n",
    "        plt.plot(return_history, label='Return')\n",
    "        plt.plot(pd.Series(return_history).rolling(10).mean(), label='MA')\n",
    "        plt.xlabel('episode')\n",
    "        plt.ylabel('return')\n",
    "        plt.grid(True)\n",
    "        _ = plt.legend()\n",
    "        plt.show()\n",
    "    return_history.append(accumulated_rewards)\n",
    "    \n",
    "env.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-10ea9f6de0129a23",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    },
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Plot the learning curve of the training process!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-9c960043e350900a",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "plt.plot(return_history, label='Return')\n",
    "plt.plot(pd.Series(return_history, name='reward_history').rolling(10).mean(), label='MA')\n",
    "plt.xlabel('episode')\n",
    "plt.ylabel('return')\n",
    "plt.grid(True)\n",
    "_=plt.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-88ab1f013663dfea",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "### Execution\n",
    "\n",
    "Use `deterministic` to choose between deterministic execution (applying $\\mu$ directly) or taking the stochastic action by sampling from the normal distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-5eaa6be280fc63bb",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    },
    "pycharm": {
     "name": "#%%\n"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "env = gym.make(\"LunarLander-v3\", continuous=True, render_mode=\"human\")\n",
    "\n",
    "deterministic = False\n",
    "\n",
    "for j in tqdm(range(10)):\n",
    "    state, _ = env.reset()\n",
    "    accumulated_rewards = 0\n",
    "\n",
    "    while True:\n",
    "        with torch.no_grad(): \n",
    "            next_state, reward, done, action, mu, sigma = interact(\n",
    "                state = torch.tensor(state.reshape(1, -1), dtype=torch.float32),\n",
    "                policy=policy,\n",
    "                deterministic=deterministic,\n",
    "                featurizer=featurizer\n",
    "            )\n",
    "\n",
    "        accumulated_rewards += reward\n",
    "        state = next_state\n",
    "\n",
    "        if done:\n",
    "            print(f\"Episode {j}, Total Reward: {accumulated_rewards}\")\n",
    "            break\n",
    "\n",
    "env.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-2436617277fd8ac5",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## 2) Actor-Critic with TD(0) Targets\n",
    "\n",
    "Write an actor-critic (AC) algorithm to land the lander in the landing zone.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-2f6e82b9fd39ab28",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Use the following cell to create two function approximators. One to estimate the state values (critic) and one to decide on the actions to take (actor). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-ee84b13c4cd3bb0b",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "env = gym.make(\"LunarLander-v3\", continuous=True, render_mode=\"rgb_array\")\n",
    "\n",
    "state = np.reshape(env.reset()[0], (1, -1))\n",
    "input_dim = len(featurizer.transform(state)[0])\n",
    "action_space_dim = len(env.action_space.sample())\n",
    "\n",
    "class CriticNetwork(nn.Module):\n",
    "    def __init__(self, input_dim):\n",
    "        super(CriticNetwork, self).__init__()\n",
    "        self.fc1 = nn.Linear(input_dim, 64)\n",
    "        self.fc2 = nn.Linear(64, 64)\n",
    "        self.fc3 = nn.Linear(64, 1)\n",
    "\n",
    "    def forward(self, x):\n",
    "        x = F.leaky_relu(self.fc1(x), 0.1)\n",
    "        x = F.leaky_relu(self.fc2(x), 0.1)\n",
    "        x = self.fc3(x)\n",
    "        return x\n",
    "\n",
    "critic = CriticNetwork(input_dim)\n",
    "\n",
    "for param in critic.parameters():\n",
    "    param.data.mul_(0.2)\n",
    "    \n",
    "actor = PolicyNetwork(input_dim, action_space_dim)\n",
    "\n",
    "for param in actor.parameters():\n",
    "    param.data.mul_(0.4)\n",
    "\n",
    "env.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-6a0128a78a488e3b",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Use the following template to write an AC algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def learn_critic(critic, critic_optimizer, state, next_state, done, gamma, featurizer):\n",
    "    \"\"\"Updates the critic network based on the provided states and rewards.\n",
    "\n",
    "    Args:\n",
    "        critic: The critic network to be updated\n",
    "        critic_optimizer: The optimizer for updating the critic network\n",
    "        state: The current state\n",
    "        next_state: The next state\n",
    "        done: Flag indicating whether the episode is finished\n",
    "        gamma: The discount factor for future rewards\n",
    "        featurizer: The featurizer for state preprocessing\n",
    "\n",
    "    Returns:\n",
    "        delta: The temporal difference error for the current state\n",
    "        critic_loss.item(): The value of the critic loss function\n",
    "    \"\"\"\n",
    "\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()\n",
    "\n",
    "    return delta, critic_loss.item()\n",
    "\n",
    "\n",
    "def learn_actor(actor_optimizer, action, mu, sigma, delta, gamma_k):\n",
    "    \"\"\"Updates the actor network based on the provided action, mean, and standard deviation.\n",
    "\n",
    "    Args:\n",
    "        actor_optimizer: The optimizer for updating the actor network\n",
    "        action: The action taken\n",
    "        mu: The mean of the action distribution\n",
    "        sigma: The standard deviation of the action distribution\n",
    "        delta: The temporal difference error for the current state\n",
    "        gamma_k: Importance sampling weight for updating the actor (gamma**k)\n",
    "\n",
    "    Returns:\n",
    "        actor_loss.item(): The value of the actor loss function\n",
    "    \"\"\"\n",
    "\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()\n",
    "\n",
    "    return actor_loss.item()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-fa43dbbddfe825eb",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "env = gym.make(\"LunarLander-v3\", continuous=True, render_mode=\"rgb_array\")\n",
    "return_history = []\n",
    "\n",
    "alpha_critic = 1e-4\n",
    "alpha_actor = 1e-5\n",
    "gamma = 0.99\n",
    "nb_episodes = 500\n",
    "max_episode_len = 500\n",
    "\n",
    "critic_optimizer = torch.optim.Adam(critic.parameters(), lr=alpha_critic)\n",
    "actor_optimizer = torch.optim.Adam(actor.parameters(), lr=alpha_actor)\n",
    "\n",
    "for j in tqdm(range(nb_episodes)):\n",
    "\n",
    "    accumulated_rewards = 0\n",
    "    state, _ = env.reset()\n",
    "    gamma_k = 1\n",
    "\n",
    "    if j % 200 == 0 and j != 0:\n",
    "        max_episode_len+=100\n",
    "    \n",
    "    for _ in range(max_episode_len):\n",
    "\n",
    "        # YOUR CODE HERE\n",
    "        raise NotImplementedError()\n",
    "        accumulated_rewards += reward\n",
    "\n",
    "        # YOUR CODE HERE\n",
    "        raise NotImplementedError()\n",
    "\n",
    "        if done:\n",
    "            break\n",
    "\n",
    "    if j % 100 == 0 and j != 0:\n",
    "        plt.plot(return_history, label='Return')\n",
    "        plt.plot(pd.Series(return_history).rolling(10).mean(), label='MA')\n",
    "        plt.xlabel('episode')\n",
    "        plt.ylabel('return')\n",
    "        plt.grid(True)\n",
    "        _ = plt.legend()\n",
    "        plt.show()\n",
    "    return_history.append(accumulated_rewards)\n",
    "\n",
    "env.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-cff391e6a4178b47",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Plot the learning curve of the training process!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-ed5ed20c7c3d4932",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "plt.plot(return_history, label='Return')\n",
    "plt.plot(pd.Series(return_history, name='reward_history').rolling(10).mean(), label='MA')\n",
    "plt.xlabel('episode')\n",
    "plt.ylabel('return')\n",
    "plt.grid(True)\n",
    "_=plt.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-41b8edc65ba40233",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## Execution\n",
    "\n",
    "Use `deterministic` to choose between deterministic execution (apply $\\mu$) directly or take the stochastic action by sampling from the normal distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-d7232e344bf16559",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "env = gym.make(\"LunarLander-v3\", continuous=True, render_mode=\"human\")\n",
    "deterministic = True\n",
    "\n",
    "for j in tqdm(range(10)):\n",
    "    state, _ = env.reset()\n",
    "    accumulated_rewards = 0\n",
    "\n",
    "    while True:\n",
    "\n",
    "        with torch.no_grad():\n",
    "\n",
    "            next_state, reward, done, action, mu, sigma = interact(\n",
    "                state=np.reshape(next_state, (1, -1)),\n",
    "                policy=actor,\n",
    "                deterministic=deterministic,\n",
    "                featurizer=featurizer\n",
    "            )\n",
    "\n",
    "        accumulated_rewards += reward\n",
    "        state = next_state\n",
    "\n",
    "        if done:\n",
    "            print(f\"Episode {j}, Total Reward: {accumulated_rewards}\")\n",
    "            break\n",
    "\n",
    "env.close()"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Create Assignment",
  "kernelspec": {
   "display_name": "RL25",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
